Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Community][fix] Fix Azure cosmos db no SQL similarity search with score and mmr #28479

Conversation

wassim-mechergui-shift
Copy link

@wassim-mechergui-shift wassim-mechergui-shift commented Dec 3, 2024

Issues:

Changes:

  • VectorStoreRetriever fix:
    • For the similarity search from VectorStoreRetriever a rather typical implementation was added (example for Qdrant)
  • with_embeddings to similarity search:
    • Fix for the with_embedding keyword) due to a change made here (Azure Cosmos DB _similarity_search_with_score fails with KeyError: 'metadata' #26097), item no longer has self._embedding_key and has "$1" (that represents it's order in the parameters). Since Cosmos DB’s SQL API does not support parameterized aliases or dynamic aliasing within the query, we chose to name it 'embeddingKey'
    • optimization : if with_embedding is not True, we don't need to query the embeddings. Querying the embeddings is heavy and makes requests much slower. This was changed
  • Changed initialisation of vector policy (from which we get the embedding key and the distance function)
    • if user specifies a dict, we check it has the correct form
    • if user specifies the container creation if it doesn't exist option, we check this is not null
    • for the final user vector policy property
      • if the container exposes one, we use that one (and raise a warning if the user specified policy has differences)
      • if the container doesn't expose one, we use the user exposed one (with an error in case nothing was specified)

These changes were tested locally.

Copy link

vercel bot commented Dec 3, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Dec 9, 2024 0:27am

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. community Related to langchain-community Ɑ: vector store Related to vector store module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Dec 3, 2024
@@ -260,6 +262,28 @@ def delete_document_by_id(self, document_id: Optional[str] = None) -> None:
raise ValueError("No document ids provided to delete.")
self._container.delete_item(document_id, partition_key=document_id)

def _select_relevance_score_fn(self) -> Callable[[float], float]:
"""
The 'correct' relevance function

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wassim-mechergui-shift wassim-mechergui-shift changed the title Fix Azure cosmos db no SQL similarity search with score and mmr [Community][Bugfix] Fix Azure cosmos db no SQL similarity search with score and mmr Dec 6, 2024
@wassim-mechergui-shift wassim-mechergui-shift changed the title [Community][Bugfix] Fix Azure cosmos db no SQL similarity search with score and mmr [Community][fix] Fix Azure cosmos db no SQL similarity search with score and mmr Dec 6, 2024
@@ -121,6 +121,9 @@ def __init__(
self._embedding_key = self._vector_embedding_policy["vectorEmbeddings"][0][
"path"
][1:]
self._distance_strategy = self._vector_embedding_policy["vectorEmbeddings"][0][
"distanceFunction"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this guaranteed to always be specified? seems like we only check that self._vector_embedding_policy isn't empty if self._create_container is True

Copy link
Author

@wassim-mechergui-shift wassim-mechergui-shift Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure on what's the intended logic of this code, when I wrote it I followed the embedding_key up above.
But you are right. vector_embedding_policy in this case is a dict. Nothing enforces what this dict needs to be if create_container is not specified. I think there's a minor issue in the code here that can be addressed.

In essence, if we want to use an azure cosmos db no sql vector store, it must have the vector search feature enabled.
To enable and specify it, when we create the container, we must specify (among other things) a Vector Policy (the dict in question). microsoft link
Once created, it's possible to query for these info in the properties of the container. (through container.read()) (vector search may be activated but not exposed, see comment down below)

Additionally, once created, _database.create_container_if_not_exists (the method we use here to create our connection to the container) will not be overriden by the new specified policy (and it's totally valid also in this case to give it None policy).

So here are my suggested changes:

  • make vector_embedding_policy optional since we only force the checks if create container is true
  • self._vector_embedding_policy shouldn't be acquired from vector_embedding_policy but from the container itself self._vector_embedding_policy should be acquired from container if it is exposed otherwise from user specified options
  • optionally add a warning if there's a mismatch between container exposed vector_embedding_policy and used specified vector_embedding_policy

LMK if this is okay with you ! (I added these changes)

Copy link
Author

@wassim-mechergui-shift wassim-mechergui-shift Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I was testing and this assumption is not always valid

Once created, it's possible to query for these info in the properties of the container. (through container.read())

It seems there is a bug in azure cosmos db no sql but if the container was the result of a partition_key migration then it seems the information on the vector policy is lost (so not possible to read through container.read()). However, the container still fully supports vector search !

I'll readapt the implementation to account for this case. It seems no longer possible to use container.read() as the one source of truth.

I'm sorry for the long post I wanted to be thorooughly descriptive and argumentative in my implementation choices!

@@ -273,8 +336,12 @@ def _similarity_search_with_score(
if pre_filter is None or pre_filter.get("limit_offset_clause") is None:
query += "TOP @limit "

embedding_field = ""
if with_embedding:
embedding_field = "c[@embeddingKey] as embeddingKey, "
Copy link
Author

@wassim-mechergui-shift wassim-mechergui-shift Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a minor optimization but its impact can be substantial: embeddings needed only when the with_embedding argument is specified as True
Removing it from default because it adds a heavier weight on the query

When we ran our tests, even with chunk of size 4K tokens, the retrieval is still so much faster when we don't query for the embeddings (we go from 0.3s/chunk on a test to retrieve 100 chunks to ~0.02s/chunk)

Copy link
Collaborator

@ccurme ccurme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wassim-mechergui-shift it looks like there's some redundancy with #28716.

Closing for now but happy to re-open if there are differences in functionality and you'd like to merge with latest changes (ok to open another PR as well).

@ccurme ccurme closed this Dec 20, 2024
@wassim-mechergui-shift
Copy link
Author

Hi @ccurme ! Thanks for the comment. I didn't notice the other PR.

I will test when I get time from my side if the bugs I mentionned here were addressed in that PR (and hopefully no additional bugs were added). If so I'll merge here with the latest changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature community Related to langchain-community size:M This PR changes 30-99 lines, ignoring generated files. Ɑ: vector store Related to vector store module
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants